Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

Feeling a little bewildered. Lots to take in. Had forgotten about the syntax of R, been years since last time since I used it. The intro throught the book was thorough, but reading is for not the same as hands-on when it comes to coding. At least the book works as a good source for reference in the future.

Chapter 2: Regression and model validation: a case study

This week we dug into regression analysis, and model validation. Both very useful and interesting, and, I have to say, somewhat foreign to me. Too long has passeed since I last dabbed in statistics. Much reading was to be done to be able to grasp the concepts well enough. Overall I think this week gave me a good basic understanding of how to conduct regressional analysis with RStudio.

date()
## [1] "Mon Nov 28 22:52:24 2022"

This week’s dataset komes from Kimmo Vehkalahti. It contains data collected by him in a survey conducted among university students in an introductory course to social sciences. The data was collected between Dec. 2014 and Jan. 2015. The survey was conducted among 183 students, and it includes data on 56 questionnaire variables, 1 compound variable (attitude), and three contextualizing variables (age, sex, points in final exam), bringing the total to 60 columns.

This data was cleaned and made uniform as follows: - the survey measured three different approaches of learning, represented by answers in appropriate columns. These values were merged into three compound value columns corresponding to three different approaches: deep, surface (surf) and strategic (strat). The numerical value of the new columns is the mean answers pertinent to each category (original answers were on a scale of 1 to 5). - The “attitude” compound column was scaled to the same level (1-5). - the now superfluous raw data survey columns were removed, leaving 7 columns: age, sex, points, attitude, deep, surf and strat. - rows with values of zero in the “points” column (17 in total) were removed, leaving 166 rows. Since the aim of the data was to measure the impact of studying attitudes and the approaches to learning with success in exam as the main measurement of their impact, the rows without exam results were redundant. - as minor data curation procedures the naming of headers was normalized

In the following code the .csv-file containing the cleaned data is read into the variable “learning2014”, and its dimensions (dim) and structure (str) are checked to make sure eveything is as it should be.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## âś” ggplot2 3.4.0      âś” purrr   0.3.5 
## âś” tibble  3.1.8      âś” dplyr   1.0.10
## âś” tidyr   1.2.1      âś” stringr 1.4.1 
## âś” readr   2.1.3      âś” forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
setwd("data")
learning2014 <- read_csv("learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(learning2014)
## [1] 166   7
str(learning2014)
## spc_tbl_ [166 Ă— 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The data and its seven main variables can be summarized as follows:

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
p <- ggpairs(learning2014, mapping = aes(col = learning2014$gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

The data includes more female (110) and younger (median age 22 years) participants. The strongest correlation with exam points is found with attitude of students and “strategic” approach to learning. Male participants show somewhat higher scores in attitude. Age correlated best with “strategic” approach to learning. The three variables “deep”, “strategic” and “attitude” show positive interralation. We will choose these last three as explanatory variables, with points being the target variable.

my_model <- lm(points ~ attitude + stra + deep, data = learning2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + deep, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## stra          0.9621     0.5367   1.793  0.07489 .  
## deep         -0.7492     0.7507  -0.998  0.31974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

This summary shows the significance of each explanatory variable in relation to the target variable. First it lists residuals, the difference of the predicted model and the actual value. Of the coefficients interlinked t value and p value (Pr(>|t|)) are most significant. Looking at them, it seems that the variable “deep” does not bear correlation with points (Pr(>|t| 0.31974), and is statistically non-significant. Thus we will leave it out of the equation and recalculate with two explanatory variables.

my_model2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

A slightly better result. Attitude seems to be the single most significant single factor explaining the exam result (p < 6.31e-09). Of the different approaches to studying “strategic” bears most significance. It has a p-value of 8.9%, verging on being statistically significant, but not quite. The adjusted R-squared -value of 0.1951 tells us that c. 19.5% of the points the students got from the exam can be explained by their attitude and adherence to strategic approach to learning, with attitude accounting for almost all of this effect. The p-value is 7.734e-09, very significant, and the F-statistic 20.99 is considerably larger than Critical F on 2 and 163 degrees of freedom (~3.0). Therefore we can reject the null hypothesis.

What remains are regression diagnostics, shown below in three tables.

par(mfrow = c(2,2))
plot(my_model2, c(1, 2, 5))

The residuals vs Fitted model -graph confirms that the fitted model is appropriate; the residuals and the fitted values are uncorrelated. The Normal Q-Q shows no evidence of major departure from linearity, confirming the sample data follows normal distribution and is not skewed. Residuals vs Leverage shows no data points fall outside Cook’s distance, and therefore there are no influential individual datapoints. Everything checks out.


Chapter 3: Logistic regression

This week was all about the title. The statistics were interesting and the tools powerful. Should be useful in the future. The only problem I continuously have with R is that every function seems to be like a “magic box”. I just put in some numbers, and it does its magic in secret, spewing out magnificent graphs and whatnot, but I always feel like I have no clue or maybe even control how it does what it does. Like driving with an autopilot: just enter the destination, and the car takes care of everything. Can I say that I drove there? Or even that I can drive, if all I do is input destinations? Each function works differently; this seems not so much about learning general R syntax, but instead learning which function does what, and what you need to feed it. Perhaps the feeling ensues because earlier on I have mostly done everything step-by-step by myself, each graph and such. This feels almost like cheating at times, and on the other hand like giving away control.

date()
## [1] "Mon Nov 28 22:52:33 2022"

1. Create chapter3.Rmd-file created in data wrangling -> child to index.Rmd

Done.

Also, loading all the needed libraries here.

library(scales); library(dplyr); library(tidyverse); library(ggplot2); library(boot)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor

2. Read the data into R.

3. Choose 4 interesting variables in the data

for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)

The 4 variables I have chosen to be analysed in relation with alcohol consumption are:

  1. Absences (variable name: absences; numeric, from 0 to 45)
  • hypotheses: correlates positively with high alcohol consumption
  1. Going out with friends (variable name: goout; numeric: from 1 - very low to 5 - very high)
  • hypotheses: correlates positively with high alcohol consumption
  1. Extra-curricular activities (variable name: activities; binary: yes or no)
  • hypotheses: correlates negatively with high alcohol consumption
  1. Weekly study time (variable name: studytime; numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  • hypotheses: correlates negatively with high alcohol consumption

Going out a lot with friends would probably be most highly correlated with high alcohol consumption. High alcohol consumption also could cause absences from classes. The last two, on the other hand, are such that they would limit ones ability to partake in sociable drinking.

4. Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption

Use for example cross-tabulations, bar plots and box plots. Comment on your findings and compare the results of your exploration to your previously stated hypotheses. (0-5 points)

This will be a “preliminary” probe, with brief overviews on each chosen variable.

4.1 Absences

Absences is a numeric value with amount of absences, high_use is a binary. Correlation, if any, should be visible from a simple boxplot.

plot_absences <- ggplot(alc, aes(x = high_use, y = absences))
plot_absences + geom_boxplot() + ggtitle("1. Amount of absences vs. high use of alcohol")

The hypotheses seems to be corrected, and the correlation significant.

4.2 Going out with friends

Going out was graded on a scale from 1 to 5, and high_use is a binary. A barplot should reveal if there are any differences between the two groups of high-users and low-users.

plot_goout <- ggplot(alc, aes(x = goout))
plot_goout + geom_bar() + facet_wrap("high_use", labeller = label_both) + ggtitle("2. going out with friends vs. high alcohol usage") 

As visible from the barplot, the high users do go out a lot more than the low users. The barplot leans heavily to the right. A simple check of mean values and distribution percentages confirms the hypotheses.

alc %>% group_by(high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## # A tibble: 2 Ă— 3
##   high_use count mean_goout
##   <lgl>    <int>      <dbl>
## 1 FALSE      259       2.85
## 2 TRUE       111       3.73
alc %>% group_by(high_use, goout) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 10 Ă— 4
## # Groups:   high_use [2]
##    high_use goout count percentage
##    <lgl>    <dbl> <int> <chr>     
##  1 FALSE        1    19 7.34%     
##  2 FALSE        2    82 31.66%    
##  3 FALSE        3    97 37.45%    
##  4 FALSE        4    40 15.44%    
##  5 FALSE        5    21 8.11%     
##  6 TRUE         1     3 2.7%      
##  7 TRUE         2    15 13.5%     
##  8 TRUE         3    23 20.7%     
##  9 TRUE         4    38 34.2%     
## 10 TRUE         5    32 28.8%

4.3 Extra-curricular activities

Extra-curricular activities is a binary value, as is high_use. A simple check of the amounts might give us an overview

alc %>% group_by(high_use, activities) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 4 Ă— 4
## # Groups:   high_use [2]
##   high_use activities count percentage
##   <lgl>    <chr>      <int> <chr>     
## 1 FALSE    no           120 46.3%     
## 2 FALSE    yes          139 53.7%     
## 3 TRUE     no            59 53.2%     
## 4 TRUE     yes           52 46.8%

Perhaps surprisingly not a significant difference in extracurricular activities between the two groups; both basically break even. Further inquiries regarding this variable would be futile.

4.4 Studying time

Weekly study time is a numeric field (values: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours). A barplot should do the trick for overview of correlation.

plot_studytime <- ggplot(alc, aes(x = studytime))
plot_studytime + geom_bar() + facet_wrap("high_use", labeller = label_both) + ggtitle("4. Weekly study time vs. high alcohol usage") 

alc %>% group_by(high_use, studytime) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 8 Ă— 4
## # Groups:   high_use [2]
##   high_use studytime count percentage
##   <lgl>        <dbl> <int> <chr>     
## 1 FALSE            1    56 21.6%     
## 2 FALSE            2   128 49.4%     
## 3 FALSE            3    52 20.1%     
## 4 FALSE            4    23 8.9%      
## 5 TRUE             1    42 37.8%     
## 6 TRUE             2    57 51.4%     
## 7 TRUE             3     8 7.2%      
## 8 TRUE             4     4 3.6%

As can be seen, almost 90% of high alcohol users studied less than 5 hours per week, whereas lower alcohol users are more normally divided, with much more time used on studies. Correlation is strong here.

5. Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.

Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis. (0-5 points)

Okey dokey. First the code, then the analysis.

m <- glm(high_use ~ absences + goout  + activities + studytime, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + goout + activities + studytime, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9196  -0.7732  -0.5083   0.8446   2.5676  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.34944    0.53871  -4.361 1.29e-05 ***
## absences       0.06961    0.02213   3.145  0.00166 ** 
## goout          0.73572    0.11959   6.152 7.64e-10 ***
## activitiesyes -0.31151    0.25661  -1.214  0.22476    
## studytime     -0.56049    0.16914  -3.314  0.00092 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 375.68  on 365  degrees of freedom
## AIC: 385.68
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                       OR      2.5 %    97.5 %
## (Intercept)   0.09542238 0.03227113 0.2680871
## absences      1.07208689 1.02777425 1.1223836
## goout         2.08697580 1.66084887 2.6570839
## activitiesyes 0.73234019 0.44132598 1.2094464
## studytime     0.57093162 0.40522569 0.7880221
Statistical significance:

Absences, going out and studytime are significant statistically (p-values 0.00166, 7.64e-10 and 0.00092 respectively). Activities, on the other hand, are statistically insignificant as noted already earlier. The correlations were also correctly assumed in the preliminary hypotheses in regard of the three statistically significant variables.

The odds ratios and confidence intervals:

OR and CI show that 1 absence equals 7 percent heightened likelihood of the person belonging to the group of high users, with values in real population falling between 2.7 and 12.2 percent. Similarly belonging to 1 higher group in studytime meant 43 percent reducted likelihood to belong to high users, real life population values between 59.5 and 21.2 percent. Belonging to higher group in going out -variable, on the other hand, meant a whopping 108 percent heightened likelyhood to be a high alcohol user, with real population values between 66 and 165.7 percent. The CI of activities includes 1, which shows its effect, positive or negative, can not be speculated in real population.

6. Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model.

Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy. (0-3 points)

We’ll use the statistically significant variables found, absences, goout and studytime.

m2 <- glm(high_use ~ absences + goout + studytime, data = alc, family = "binomial")

probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc2'
alc2 <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   239   20
##    TRUE     63   48
g <- ggplot(alc2, aes(x = probability, y = high_use))

# define the geom as points and draw the plot
g + geom_point(aes(col = prediction))

# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64594595 0.05405405 0.70000000
##    TRUE  0.17027027 0.12972973 0.30000000
##    Sum   0.81621622 0.18378378 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc2$high_use, prob = alc2$probability)
## [1] 0.2243243

Loss function gives 22.4% falsely predicted individuals. The false positives and false negatives in cross-tabulation gives 17.0% + 5.4% = 22.4% of falsely predicted individuals, the same number. This is the total proportion of inaccurately classified individuals (= the training error).

7. Bonus: Perform 10-fold cross-validation on your model.

Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in the Exercise Set (which had about 0.26 error). Could you find such a model? (0-2 points to compensate any loss of points from the above exercises)

cv <- cv.glm(data = alc2, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2351351

It seems this model has better test set performance (0.23) than the Exercise Set (0.26).

8. Super-Bonus: NOT DONE


Chapter 4: Clustering and classification

Reflections: This week was horribly busy for me, and also I became ill. I barely had the time to do the exercises and the assignments. I did skim-read through Kimmo’s book chapters also, but I must say that the whole concept of how clustering works seems a bit …vague? Yes, it works, but seems that every method either/both of 1) researcher doing guesswork of where to cut the clusters or 2) the clustering having such a random-factor that the results can only be repeated by forcing a fixed random-seed. Which sounds rather strange and does not make one feel that the process is even possible to be firmly grasped. But, I will try my best. As mentioned in the book, clustering is a basic concept of reasoning, and an absolutely necessary tool for data analysis.

date()
## [1] "Mon Nov 28 22:52:35 2022"

1. Create chapter4.rmd -file etc.

Done.

# I will load all the needed libraries and the dataset here for convenience

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(tidyverse)
library(corrplot)
## corrplot 0.92 loaded

2. Load data, explore and describe it

data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

The Boston dataset is a classic dataset that is distributed as part of the basic R package. Its title, “Housing Values in Suburbs of Boston” describes its origins well enough. The dataset was gathered for the 1978 article “Hedonic prices and the demand for clean air” (J. Environ. Economics and Management 5, 81–102) by D. Harrison and D.L. Rubinfeld. The 14 variables map factors that affected housing prices in suburbs (called “towns” in documentation; we’ll use the same term henceforth) around and include information on - for instance - crime rate, amount of nitrogen oxides in air, and access to Boston’s radial highways. Data has been already cleaned; it entails 506 rows of observations and 14 columns of variables.

The variables are:

  • crim: per capita crime rate by town.
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox: nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • black: \(1000(Bk - 0.63)^2\) where \(Bk\) is the proportion of blacks by town.
  • lstat: lower status of the population (percent).
  • medv: median value of owner-occupied homes in $1000s.

3. Summary of variables and graphical overview

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
cor_matrix <- cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Summary

Many very natural distributions, as with airborne NoX -gases. Yet, some of the values are uneven. Crime-variable for instance varies from 0.006 to 88.976, with a median of 0.25 and mean of 3.61 - it seems that there is a huge variety (and a lot of outliers) in crime between towns, with a few of them accounting for the majority of the cases. Non-retail business, on the other hand, are divided more evenly, but wealth is clearly unevenly divided (lstat and medv). Amount of POC (variable black) does also seem to probably have a lot of outliers (min: 0.32, 1st qu: 375.38, mean 356.67). Some variables seem to bear some interconnection due to similar spread (for instance dis, rad, zn).

Plot

As can be seen, some of these values mentioned above are interconnected. Lower status and median value of homes are negatively correlated to a very high degree. Amount of rooms correlates positively with more expensive houses, as can be expected. Interestingly, the dis-variable (distances to employment centres) reveals societal structures and characters of the economical centers: older housing and smaller lots on the zoned land, more (non-retail) businesses which bring more customers and employees driving daily by cars, and hence more nitrogen dioxide in the air; also more crime and more lower status inhabitants. This dynamic seems to work as a sort of a heuristic key on how to read the data. As another instance of the same phenomenon the variable rad “access to radial highways” connects all the datapoints pertinent to the same dynamic: highways connect major centers. Crime rate is connected with this variable to a high degree for this reason.

As an additional check, I will do some boxplots of some of the variables mentioned in the summary analysis. The amount of outliers makes it clear that we are dealing with data that is not evenly spread, and should be checked for clusters. As a preliminary hypothesis we might say that most likely they would be built around the dynamic of more populous and dense commercial towns vs. less populous smaller towns, with much less crime, pollution, commerce, etc.

par(mfrow = c(1, 5))
boxplot(Boston$crim, main = "crim", col = "purple", outcol = "red")
boxplot(Boston$zn, main = "zn", col = "blue", outcol = "red")
boxplot(Boston$lstat, main = "lstat", col = "yellow", outcol = "red")
boxplot(Boston$medv, main = "medv", col = "orange", outcol = "red")
boxplot(Boston$black, main = "black", col = "green", outcol = "red")

4. Standardizing, creating categorical variable and “train” and “test” -datasets

Next, we will standardize the dataset and print out summaries of the scaled data. The data will be scaled so that all the mean values of all the variables will be 0 [zero].

boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Then, as mentioned in the chapter title, we will create a categorical variable to replace the old crime value with, and the training and testing datasets.

boston_scaled$crim <- as.numeric(boston_scaled$crim)
bins <- quantile(boston_scaled$crim)
labels <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = labels)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

5. LDA on the train set

Next we’ll fit the linear discriminant analysis on the train set. Categorical crime rate is the target variable, with the other variables as predictors.

lda.fit <- lda(crime ~ ., data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The variable rad has highest correlation with crime rate, and mainly through it we ware able to cluster the towns with highest crime rates. The second and third most important factors are nox and zn, which basically differentiate the towns in order of their population density.

6. Predicting with LDA on the test data

Next we’ll use the test set data to predict the crime rate classes, and cross-tabulate them with actual results. First we’ll save them as their own variable.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)

lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18       8        0    0
##   med_low    2      20        4    0
##   med_high   0      10       14    2
##   high       0       0        0   24

As can be seen, the model does predict most of the cases correctly, with more accuracy in the low and high ends of the spectrum.

7. K-means clustering

Following the assignment, we’ll first reload the dataset and standardize it to bring the means to zero.

data(Boston)
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Next, we’ll count the Eucledian distances between the observations, and check out the summary of the values.

dist_eu <- dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Next we’ll run a k-means algorithm on the dataset.

set.seed(13)

km <- kmeans(boston_scaled, centers = 4)

pairs(boston_scaled, col = km$cluster)

Four clusters seems like a lot. Let’s try to gauge the optimal number of clusters via WCSS -analysis. A sharp turn in the graph should indicate where wouuld be a good value to make the cut.

set.seed(123)

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

Optimal point for clustering seems to be 2. Let’s rerun the k-means analysis with 2 clusters.

set.seed(13)

km <- kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

Not very informative due to the small size. Let’s take into closer look the variables we denoted correlated earlier: crim, zn, nox, dis, rad, lstat and medv.

pairs(boston_scaled[, c(1, 2, 5, 8, 9, 13, 14)], col = km$cluster)

Some of these might be still dropped, but in the main it seems that these two clusters denote meaningful and significant differences between two groups of opservations in the data. The commercial and economical centers have smaller lots, poorer air quality, better access to ringroads, more poor people, and less expensive homes. Further analysis would be needed to elaborate more.

DONE

Bonus: NOT DONE

Just not enough time. Being ill with a small baby has its drawbacks

Super-Bonus: FIRST PART DONE

Yet, this seems intriguing. Who doesn’t want more dimensions? Gotta try, I’ll give it “15 minutes”, and just list the instructions here as I go through them. Not for the points, but for the coolness factor of 3D thingies.

Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
Next, install and access the plotly package.
# install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
Create a 3D plot (cool!) of the columns of the matrix product using the code below.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

#WOW

Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=classes)

It is beautiful. Oh wait, I can move it! C O O L

Draw another 3D plot where the color is defined by the clusters of the k-means.
# plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$cluster)

Okey dokey. This produces error. Probably because the km$cluster is not part of the data matrix that we created… I should join the values from the cluster-column to the train-set, and then create the data matrix and the 3D-plot again… but this time I think I will just go to bed and sleep.

Peace out.


(more chapters to be added similarly as we proceed with the course!)